New method for taxonomic descriptions with coded notation, producing dynamic and interchangeable output

Abstract A proposal for taxonomic species description notation is presented to replace the traditional descriptive texts for a coded matrix, avoiding redundant adjectives and subjective descriptions. This is an attempt to enhance the species description rate and to make the descriptions output available to other scientific disciplines, machine learning, interactive and computer‐assisted identification keys, metadata analysis and its applications. The method consists of presenting the description of the overall morphology in a coded matrix, following a character list with detailed observed conditions for each character. The method is dynamic and open to amendments and new data addition as they become available. We test the new method describing five new species of Collembola Symphypleona of the genus Pararrhopalites as a generalized model and made the coded output available. We conclude that a coded taxonomic description is an advance to the traditional taxonomic text, with potential to enhance the global descriptions rate. The generated descriptions are dynamic, expandable and can be easily used in other fields of science, allowing non‐experts to access the data for phylogenetic, biogeographic, ecological studies and metadata analysis. Even though an experienced taxonomist will always be necessary to make a detailed taxonomic description, it is a step forward to a general template to semi‐automated taxon recognition and to future development of auxiliary tools for species description using machine learning and templates to speed up the time‐consuming phase of schematic figures preparation, after the expert interpretations are done.


| INTRODUC TI ON
Taxonomy has been the focus of debate since the XIX century, and even recently the recognition of the taxonomic research is subject of discussion (Packer et al., 2018;Zeppelini et al., 2021).The global biodiversity crisis exposes the urgency for investment in taxonomy to reveal the largely unknown species diversity.Using Collembola as a parameter, where about 20% of its estimated diversity is known (Hopkin, 1997), between 100 and 120 new species are described each year, and it would take to taxonomists more than 400 years to uncover and describe all the unknown species diversity (Potapov et al., 2020).To be able to understand the diversification processes in Collembola, we need to speed up the rates of species description.This is a matter of concern in every area of entomology, and in some extent, the whole zoology.
Similar to many other taxonomic groups of meso-and microfauna, Collembola taxonomy is largely based on morphological analysis, observing, and describing discrete variations in diagnostic characters.The most abundant morphological source of information for species definition in Collembola is the number, distribution, and shape of cuticular chaetae, this is called chaetotaxy.The current morphological approaches for inference of homology, chaetotaxic systems for chaetal identification, are often room for great subjectivity depending on what is seen and what is visible under an optic microscope, and often different chaetotaxy systems are hardly comparable (Betsch, 1997;Betsch & Waller, 1994;Bretfeld, 1990Bretfeld, , 1999;;Potapov et al., 2020).The challenges and perspectives for Collembola taxonomy is discussed in detail, and the need for an integrative taxonomy and international efforts to direct financial support and expertise recognition to face the global biodiversity crisis, was also the focus of debate (Potapov et al., 2020;Zeppelini et al., 2021).
The impact of recent technologies of high-resolution imaging, molecular sequencing and machine learning will be a great deal towards taxonomic techniques that can improve new and known taxa recognition (Potapov et al., 2020).Integrative taxonomy, combining morphological and molecular data to define species limits is likely to be a trend for most taxonomic groups, not only Collembola.
There is, however, a particular aspect in Collembola (and nearly every taxon of the meso-and micro-fauna) that affects the viability of including molecular sequences in new species descriptions, in many, if not most cases.It is rather a logistic problem, but many times there is not an alternative.The problem is that almost all new species are discovered under light microscope, which means that the specimen was mounted in a slide, after being cleared under several different techniques of chemical washes, which destroy the tissues and, consequently, genetic material.
It is only after the taxonomic identification, that a species is recognized as new for science or undescribed.More often than not, the material analyzed is a limited set of specimens, and there is no available material for molecular analysis after the taxonomic identification and morphologic study, if the extraction of DNA/RNA was not performed before mounting the specimens in slides for microscopy.
Accepting that molecular analysis facilities are available, many times the biological specimens needed for molecular sequencing may be available only in a future, after the species is described.Even when scanning electron microscopy (SEM) is possible, depending on the structure, it is hard to get images of all diagnostic features and light microscopy may be needed as well.Nevertheless, high-resolution imaging and molecular data are powerful tools, and may be indispensable for accurate taxonomic research and species delimitation.
Therefore, the morphologic descriptions must be dynamic, open to easy amendment and additional data insertion.Furthermore, it must be presented in an interchangeable language, to allow the information to flow across different disciplines.Among all methods applied to the external morphology study of Collembola, chaetotaxy is certainly the most complex and extensively detailed (Betsch & Waller, 1994;Cassagnau, 1974;Deharveng, 1983;Fjellberg, 1999;Jordana & Baquero, 2005;Nayrolles, 1988Nayrolles, , 1990aNayrolles, , 1990b;;Potapov, 2001;Szeptycki, 1979Szeptycki, , 1972;;Yosii, 1960).There are many chaetae and groups of chaetae that vary in position and shape in such a way that they allow a great deal of homology inferences.
However, the most advanced approaches are also very complex, which make interpretation difficult and increase ambiguity.These aspects circumscribe the deep taxonomic research to restricted groups of experts, posing difficulties to comparative studies even among different orders of Collembola.In addition, the traditional descriptive texts with morphological and chaetotaxic information are difficult to integrate with machine learning and computational novelties, which could give a lot of agility to phylogenetic analysis, metadata comparison, biogeography, and their various applications (Potapov et al., 2020).Despite all advances in technological instruments and methods, taxonomic descriptions are still written basically in the format as it was about two centuries ago, with a hermetic language in nearly incomprehensible texts for non-experts.This is often a greater barrier to communication among different areas of science, than the access of high-tech equipment and analytical facilities.
The proposal of a coded and illustrated description of new species that can be easily imported, transformed, amended, corrected, or expanded is presented as an alternative to the traditional descriptive taxonomic method.
The strength of the coded description is that new characters, whether morphological, molecular, and ecological, can be easily added to the list and can improve the descriptive matrix as new information is produced.These matrices can be uploaded to public libraries and kept up to date with all available information about the species, and linked to data bases as GBIF, ZooBank and electronic taxonomic catalogs available in different parts of the world, e.g., fauna.jbrj.gov.br/ fauna/ lista Brasil (Zeppelini et al., 2023) and www.colle mbola.org (Bellinger et al., 1996).
Finally, the new proposition for taxonomic notation will not dismiss the need of a experienced taxonomist, as the pre and post descriptional elements (e.g., type material, habitat, distribution, remarks), and all the analytical study (morphological, molecular) will always depend on the expertise of the researcher, nevertheless the identification phase can be automatized and the schematic figures preparation, a very time-consuming phase of the whole description work, can be speed up with templates for each taxonomic group in a pop up fashion, as the data matrix is fulfilled.This may allow the taxonomists to enhance their productivity, increasing the species descriptions rate.

| Coded taxonomic description
The order Symphypleona Börner, 1901 shows some ambiguity in current morphological methods, particularly when describing the head and body chaetotaxy (Betsch, 1997;Betsch & Waller, 1994;Bretfeld, 1990Bretfeld, , 1999;;Christiansen & Bellinger, 1998).The order is composed by springtails with globular body shape, as a result of modification and fusion of thoracic and abdominal segments I-IV, this condition hinders the direct assignment of segments identity.
An approach that can reduce the ambiguity of the taxonomic descriptions is the description of body parts into coded morphological units, straightforwardly representing the actual body segments and appendicular whorls (Hopkin, 1997;Jura et al., 1987;Nayrolles, 1988Nayrolles, , 1990aNayrolles, , 1990bNayrolles, , 1991;;Tomizuka & Machida, 2015), in such a way that any species can be compared from the coded data base.This is in replacement to the traditional descriptive text, many times with ambiguous terminology, and often applying different and not directly comparable chaetotaxic systems.The coded notation method would lead to a more comprehensive analysis of the chaetotaxy, as well as a direct availability of the data for comparative studies.Furthermore, a coded description can easily be amended, molecular data can be added in the character list and matrix, and new complementary morphological features can be inserted as new information is available.
The qualitative description of the shape and size of the different chaeta is also subject to a great deal of ambiguity and poor definition, the adjectives are not standard and the very definition of what is a macro-, meso-or microchaeta is not always clear.
Therefore, a bank of shapes with high quality images is imperative to discard all the subjective descriptions.There are several chaetae banks published for different groups of Collembola, including some with precise line drawings (Betsch, 1980;Christiansen, 1966;Deharveng, 1983;Nayrolles, 1991), and some with SEM photography (Cipola et al., 2020;de Lima et al., 2022;Lukić et al., 2010;Zeppelini et al., 2022;Zhang & Deharveng, 2015), it is a matter of time to have a fully reliable chaetal shapes collection, so a specific chaeta can be addressed directly by its reference in the bank, in the coded description.
A standard, fully coded method for species description may be an improvement to the traditional descriptive text, it may allow to use machine learning and high-quality imaging to enhance the efficiency of species descriptions and diversity recognition, offering a powerful tool to understanding of global processes of diversification and distribution, and face the biodiversity decline.

| Chaetal fields and morphological units
We attempt to access the chaetotaxy of the head and great abdomen of Collembola, by identifying body segments arranged in each tagma (Figures 1 and 2), or whorl in each appendage (Figures 3-5) (Hopkin, 1997;Jura et al., 1987;Nayrolles, 1988Nayrolles, , 1990aNayrolles, , 1990bNayrolles, , 1991;;Tomizuka & Machida, 2015).Each segment has its own set of chaetae, and more than one chaetal field may be observed in a single body segment.A chaetal field is a group of associated chaetae that are consistently observed in a given body segment, often associated with some landmark on the cuticle (Figures 6 and 7).
The morphological units are the actual observed character in a given species.Once recognized all the chaetal fields, the characters are listed, and their inherent character states are described.
Each morphological unit in the character list is given a code (0, 1, 2) to each observed condition.It is important to note that it is not a phylogenetic matrix, once the codes in the resulting matrix are not supposed to include hypothesis of ordering or polarity, and both apomorphies and plesiomorphies may be listed as character, instead, it is a descriptive coded character state matrix.

| Head, body, and appendages chaetotaxy
Chaetotaxic systems attempt to labeling each chaeta along the body, where the label indicates a specific chaeta and its position in the body of the animal, then a qualitative description is made (e.g., spinelike chaeta, macrochaeta, club shaped sensillum, palmate, serrate, lanceolate), bringing subjectivity in the interpretation of a specific chaeta labeling, and by the many different adjectives that can apply for a given shape, depending on the author.
Here we map regions that can be compared in different taxonomic groups, the chaetal fields, within each head and body segment (Figures 1 and 2), and to appendages (Figures 3-5), and refer to a shape for the chaeta in an image data set, the chaetae bank, with images of each kind of chaeta found in the taxon (Figure 8).
After the analysis of the chaetotaxy applying the traditional systems, the labeling of the chaetae is replaced by a code describing the total number of chaetae in the chaetal field, and the qualitative description of the different kind of chaetae found in each chaetal field, is replaced by the respective number in the chaetae bank that represents the actual observed shape and size, to compose the morphological unit definition (see Table 1).

| Coded descriptive dataset
The information of the whole collection of data of each species, result in a dataset as synthesized in Table 1, this is the final morphologic description outcome and represents the complete and up to date set of information for each studied species.However, it is dynamic and open to additional information, when available.
The coded dataset is hierarchically ordered in four columns namely, Tagma, Segment, Chaetal field and Morphological unit (Table 1), a fifth column is inserted with the coded information of the   as in Table 1.species (Figure 1).The lines of the resulting dataset bring the different features of the chaetotaxy and general morphology (e.g., eyes, foot complex).The cells in the column of morphological unit are the actual features to be observed in the specimen, where each one is a recognizable morphological unit of the animal whole morphology, that is described and coded in the character list (Appendix).

| Testing the coded description
To test the proposed system, we describe five new species of the genus Pararrhopalites Bonet & Tellez, 1947 (zoobank.org:pub:9ED865EA-F95A-4CBE-947C-3A5C6CD81907),from the order Symphypleona, using Pararrhopalites fallaciosus sp.n., as the explanatory example in the SEM overall morphologic analysis.First, we describe P. fallaciosus sp.n., where all the chaetal fields are delimited, and the species is morphologically defined.The character list is derived from this revision and presented in Appendix.The final descriptions for the remaining new species will bring the code only (Morphologic unit code).
Here we propose the coded description for Pararrhopalites, a genus of Symphypleona, however, once fully established, the system must be applicable to the orders Poduromorpha and Entomobryomorpha as well.Ideally a similar approach could be used to any other zoological taxa.
The chaetotaxic systems were used to verify the congruence of the morphological landmarks and associated groups of chaetae which display constant expression (i.e., chaetal fields).All chaetotaxic information, the actual observed character condition, was described in the character list, and coded accordingly (Appendix).
The descriptions are presented as a list of coded characters in the descriptive plate of each newly described species.The exception is made to P. fallaciosus sp.n., where the morphologic units are described in the matrix corresponding to the chaetotaxy system cited above, as an example of what the observed features are (before coding).The detailed chaetotaxy analysis for all the five species described here is available in the Data S1.

| Habitat and distribution
The species was collected in drilling holes, occurring in the Subterranean Shallow Habitat (SSH), its known distribution is restricted to the type locality, despite the sampling efforts in nearby  Good's Biogeographic zone 27 (Culik & Zeppelini Filho, 2003;Good, 1974).The climate according to Köppen's system is As (de Sá Júnior et al., 2012;Köppen, 1936;Shear, 1966), presenting dry winters and wet summers, average temperatures of 18°C during winter and 22°C in summer (valid for all the five species described here).

| Remarks
The new species resembles P. queirozi in the shape of the subanal appendage and cephalic chaetotaxy but can be clearly distinguished by the number of subsegments of Ant.IV (eight in P. queirozi, 10 in the new species), the presence of inner tooth in all ungues, and apical filament exceeding the tip of the unguis in all three empodial complexes in P. queirozi.The new species is similar to Pararrhopalites hermesi sp.n. in the shape of subanal appendages, the lack of inner tooth of all ungues.They differ by the number of eyes (1+1 in P. fallaciosus sp.n. and 0+0 in P. hermesi sp.n.), number of subsegments in Ant.IV (10 and nine respectively), the lack of corner tooth in all unguiculi and mucro with inner lamella smooth in P. hermesi sp.n.
The reduced number of chaetae on the dorsal posterior part of the great abdomen (17+17) also differentiates P. fallaciosus sp.n. from other species of the queirozi-group (all the species which share the same female subanal appendages).

| Remarks
This species is part of a group with a specific kind of subanal appendages (number 28 in Figure 8), which includes P. queirozi and P. fallaciosus sp.n.The species has an intermediary number of subsegments on Ant.IV (nine subsegments), has no eyes, lacks the corner tooth in all unguiculi, and the inner lamella of the mucro is smooth.This combination of features can easily differentiate the three species.
Despite its wide distribution, P. hermesi sp.n. presents some features that may be indicative of its relation to the SSH environment, for instance eye reduction, Ant.IV shorter (nine subsegments), overall small body size, and the reduction of the corner tooth and apical filament on unguiculus.

| Habitat and distribution
This species is restricted to a single area, there are only 10 records for three small caves in the same iron rock formation.The species is most likely distributed along the canga, a SSH formation resulting of weathering of the iron rock, that often connect caves in the same lithology.

| Remarks
This species resembles P. sideroicus Zeppelini &Brito, 2014 andP. ubiquum Zeppelini et al., 2018, in the shape of the subanal appendages and general body chaetotaxy but differ from all the species of the genus with records in this area by lacking the interantennal sensillar triangle, this feature seems to be shared by both the species of the queirozi-group and the ubiquum-group.
The presence of only three chaetae in the dorsal cephalic area DII is also very unusual for the genus and the reduced number of chaetae in the dorsal posterior part of the great abdomen can be diagnostic for this species.

| DISCUSS ION
To access the global species diversity, it is mandatory to enhance the description rates of new taxa, mainly where the biodiversity is least known.A description protocol that can communicate the morphologic characters in a coded notation, allows the application of new technologies in the research and machine learning, which may be a major turnover in the discipline, and affect the species description rates.The scarcity of trained taxonomists and the hermeticity of the taxonomic description manuscripts are the biggest barriers to the advance of the knowledge on the species diversity and evolutionary processes of diversification, both clue elements to understand the global biodiversity decline.
In the study of Collembola, as well as in many other groups, the information content of a traditional taxonomic text is often difficult to access and cannot be transported to analytical software without a detailed revision of the species description, which many times demands an expert in the taxon.The traditional format is also almost impossible to be used for machine learning, as there are many differences in the presentation of the data, that can make the comparison among different manuscripts impossible to non-experts and to artificial intelligence.The open character list of the coded description allows easy insertion and correction of the information, and the character lists, the chaetae banks and the coded species descriptions are fully compatible with technologies that work with data matrices.
Our results can be synthesized in the following conclusions: 1.The coded taxonomic description is a notation method that produces interchangeable data, fully available for different scientific disciplines.The data can be used by non-specialists for different purposes in science.
2. The method makes it possible to add any source of new data to the description when it becames available.It is dynamic and open as a continuous list of characters, the updating of the knowledge of a given species is not dependent of a traditional taxonomic revision.
3. The method allows machine learning that can help to speed the taxon identification and species description rates where they are least known.This can be an important tool to fight global biodiversity crisis.
Figure 8.To the combination of chaetae number and type is given a numeric code (this example is coded 0 in the character list Appendix).
areas and in the whole region, an important mining area which is being consistently sampled in the last decade.
3.3.1 | Habitat and distributionThis species is known from caves and SSH in a range over 200 km across different lithologies.It is a regionally widespread SSH species, but it is not abundant, as there are less than 20 records of this species.F I G U R E 1 0 Pararrhopalites hermesi sp.n. body chaetotaxy and descriptive table.

F
I G U R E 11 Pararrhopalites hermesi sp.n. antennal chaetotaxy and descriptive table.

F
I G U R E 1 6 Pararrhopalites atypicus sp.n. antennal chaetotaxy and descriptive table.

F
I G U R E 2 1 Pararrhopalites ritaleeae sp.n. antennal chaetotaxy and descriptive table.

4.
Coded description is idealized to Collembola but must be applied to any taxonomic group, reducing the ambiguity of narrative F I G U R E 2 3 Pararrhopalites ritaleeae sp.n. abdominal appendages chaetotaxy.(a) Ventral tube lateral view; (b) tenaculum lateral view; (c) furcula chaetotaxy and mucronal lamellae.Solid circles -anterior view, hollow circles -posterior view.F I G U R E 2 4 Pararrhopalites ironicus sp.n. cephalic chaetotaxy and descriptive table.

F
I G U R E 2 6 Pararrhopalites ironicus sp.n. antennal chaetotaxy and descriptive table.

6 | Pararrhopalites ironicus sp. n
Pararrhopalites ironicus sp.n. is a very restricted species, represented by only six records from two caves in the same iron rock formation, called Serra do Tamanduá, the caves are 2700 m away from each other, and connected by SSH.It is a rare species, likely to be confined 3.